3/9/23
Q: I am wondering when can we get the feedback for cs01 since it can help us to make cs02 better.
A: Great question! Grading is underway. We plan to finish Sunday night and release Monday of week 10, so that you can incorporate feedback for cs02.
Q: Why is a video mentioned in assignments for the Final Project?
A: All groups will have to do a presentation. This can be recorded (video) or presented live Thursday of finals week. See description here.
Q: The most confusing part of the lecture was why did we use the histogram plot instead of a bar plot when plotting the variables for case study 2.
A: Hmm….we used a bar plot and not a histogram when doing the plotting. For theskimoutputs (if that’s what you’re asking), it will generate a histogram in line for any numeric type variables. If I’m misunderstanding your question, please do follow up!
Notes:
git:
…will only work if you finished data set of notes (or open this week’s lab).
nyts_data |>
group_by(year) |>
summarize("Ever \n (any lifetime use)" = (mean(tobacco_ever, na.rm = TRUE) * 100),
"Current \n (any past-30-day use)" = (mean(tobacco_current, na.rm = TRUE) * 100)) |>
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") |>
ggplot(aes(x = year, y = `Percentage of students`)) +
geom_line(aes(linetype = User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(1, 2),
breaks = c("Ever \n (any lifetime use)",
"Current \n (any past-30-day use)")) +
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0, 70, by = 10),
labels = seq(0, 70, by = 10),
limits = c(0, 70)) +
# this adjusts the background style of the plot
theme_linedraw() +
labs(title = "How has tobacco use varied over the years?",
y = "% of students") +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank(),
text = element_text(size = 15),
plot.title.position = "plot")nyts_data |>
group_by(year) |>
summarize("Ever \n (any lifetime use)" = (mean(ecig_ever, na.rm = TRUE) * 100),
"Current \n (any past-30-day use)" = (mean(ecig_current, na.rm = TRUE) * 100)) |>
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") |>
ggplot(aes(x = year, y = `Percentage of students`)) +
geom_line(aes(linetype = User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(1, 2),
breaks = c("Ever \n (any lifetime use)",
"Current \n (any past-30-day use)")) +
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0, 60, by = 10),
labels = seq(0, 60, by = 10),
limits = c(0, 60)) +
# this adjusts the background style of the plot
theme_linedraw() +
labs(title = "How has e-cigarette use varied over the years?",
y = "% of students") +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank(),
text = element_text(size = 15),
plot.title.position = "plot")v_colors = viridis(5)[1:4] #specify color palatte
nyts_data |>
group_by(Group, year, n) |>
summarize(group_count = n()) |>
mutate("Percentage of students" = group_count / n * 100) |>
ggplot(aes(x = year, y = `Percentage of students`, color = Group)) +
geom_point(size = 2) +
geom_line() +
scale_color_manual(breaks = c("Neither", "Combination of products",
"Only e-cigarettes", "Only other products"),
values = v_colors) +
theme_linedraw() +
labs(x = "Year") +
theme(text = element_text(size = 15),
plot.title.position = "plot")How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
❓ Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?
v_colors = viridis(6)[c(3, 5)]
nyts_data |>
filter(!is.na(Sex)) |>
group_by(year, Sex) |>
summarize("Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
"Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)) |>
pivot_longer(cols = "Ever \n (any lifetime use)":"Current \n (any past-30-day use)",
names_to = "User",
values_to = "Percentage of students") |>
ggplot(aes(x = year, y = `Percentage of students`, color = Sex)) +
geom_line(aes(linetype = User)) +
geom_point(show.legend = FALSE, size = 2) +
scale_linetype_manual(values = c(2, 1)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
labs(title = "How does e-cigarette usage compare between males and females?",
subtitle = "Current and ever users by sex",
y = "% of students") +
theme(legend.position = "bottom",
axis.title.x = element_blank(),
text = element_text(size = 15),
plot.title.position = "plot")?## Q2: What’s the answer?
How does e-cigarette use compare between males and females?
❓ Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?
nyts_data |>
filter(year != 2015) |>
group_by(year) |>
summarize(Menthol = (mean(menthol) * 100),
`Clove or Spice` = (mean(clove_spice) * 100),
Fruit = (mean(fruit) * 100),
Chocolate = (mean(chocolate) * 100),
`Alcoholic Drink` = (mean(alcoholic_drink) * 100),
`Candy/Desserts/Sweets` = (mean(candy_dessert_sweets) * 100),
Other = (mean(other) * 100)) |>
pivot_longer(cols = -year,
names_to = "Flavor",
values_to = "Percentage of students") |>
rename(Year = year) |>
ggplot(aes(y = `Percentage of students`,
x = Year,
fill = reorder(Flavor, `Percentage of students`))) +
geom_bar(stat = "identity",
position = "dodge",
color = "black") +
scale_fill_viridis(discrete = TRUE) +
theme_linedraw() +
guides(fill = guide_legend("Flavor")) +
labs(title = "What flavors appear to be used the most frequently?",
subtitle = "Flavors of tobacco products used in the past 30 days") + theme(text = element_text(size = 15))What vaping brands and flavors appear to be used the most frequently?
❓ Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?
v_colors = viridis(6)[c(1, 4)]
nyts_data |>
group_by(year) |>
summarize(
"Cigarettes, Ever \n (any lifetime use)" = (mean(ECIGT, na.rm = TRUE) * 100),
"E-cigarettes, Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
"Cigarettes, Current \n (any past-30-day use)" = (mean(CCIGT, na.rm = TRUE) * 100),
"E-cigarettes, Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)
) |>
pivot_longer(cols = -year,
names_to = "Category",
values_to = "Percentage of students") |>
separate(Category, into = c("Product", "User"), sep = ", ") |>
ggplot(aes(
x = year,
y = `Percentage of students`,
color = Product,
linetype = User
)) +
geom_line() +
geom_point(show.legend = FALSE, size = 2) +
scale_linetype_manual(values = c(2, 1)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
labs(title = "How does e-cigarette use compare to cigarette use?",
subtitle = "Current and ever users of e-cigarettes and cigarettes",
y = "% of students") +
theme(legend.position = "bottom",
axis.title.x = element_blank(),
text = element_text(size = 15),
plot.title.position = "plot")Is there a relationship between e-cigarette/vaping use and other tobacco use?
❓ Your Turn: Given the results thus far, what’s your answer? Anything else you’d explore?
\[\frac{\text{actual proportion of group in the population}}{\text{ proportion of group in the respondents}}\]
Note: We do not need to calculate survey weights for our data; they were already supplied in the dataset, as described in the codebooks.
psu: People are sampled within strata; ensure that the responses are representative of the population of interest. Thus, often people first think about ensuring that surveys are conducted in a variety of geographical areas. This is often called the primary sampling unit or PSU. In this survey, the county where the student’s school was located was used as the PSU.stratum: A categorical variable that indicates subsets of the data that include respondents from different PSUs. In our case, strata are determined by the predominant minority in the PSU (Non-Hispanic Black or Hispanic), whether the PSU is urban or non-urban, and what percent of the students in the PSU fall into the predominant minority group. PSUs are allocated across the 16 possible strata according to the sampling scheme. These strata values allow estimates based on the survey responses to be calculated using different strata allowing for improved precision of the response estimates.finwgt: The survey weight which was calculated based on a variety of factors.srvyrsrvyr::as_survey_design(): create a survey object with a specified survey design. Includes information about how the survey was conducted that can be taken into account in the analysis.
strata - the variable(s) that defined strata in the data. (stratum)ids - cluster ids within the data. (psu)weight - which variable(s) are the survey weights. (finwgt)nest = TRUE - forces cluster ids (in this case the PSU) to be nested within the strataNote: survey weights are specific to a single year of the survey result; will need different design objects for each year
surveyMeanA()surveyMeanA <- function(currYear) {
options(survey.lonely.psu = "adjust")
currYear |>
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) |>
summarize(tobacco_ever = survey_mean(tobacco_ever,
vartype = "ci",
na.rm = TRUE),
tobacco_current = survey_mean(tobacco_current,
vartype = "ci",
na.rm = TRUE)) |>
mutate_all("*", 100) |>
pivot_longer(everything(),
names_to = "Type",
values_to = "Percentage of students") |>
mutate(Estimate = case_when(str_detect(Type, "_low") ~ "Lower",
str_detect(Type, "_upp") ~ "Upper",
TRUE ~ "Mean"),
User = case_when(str_detect(Type, "ever") ~ "Ever",
str_detect(Type, "current") ~ "Current",
TRUE ~ "Mean"))}❓ What’s this accomplishing?
nyts_data |>
group_by(year) |>
group_modify(~ surveyMeanA(.x)) |>
dplyr::select(-Type) |>
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) |>
ggplot(aes(x = year, y = Mean)) +
geom_line(aes(linetype = User)) +
geom_linerange(aes(ymin = Lower,
ymax = Upper),
linewidth = 1,
show.legend = FALSE) +
scale_linetype_manual(values = c(2, 1)) +
scale_y_continuous(breaks = seq(0, 70, by = 10),
labels = seq(0, 70, by = 10),
limits = c(0, 70)) +
theme_linedraw() +
labs(title = "Tobacco product users more prevalent after 2017",
y = "% of students") +
theme(legend.position = "none",
axis.title.x = element_blank(),
text = element_text(size = 15),
plot.title.position = "plot")\[logit(p)= log_e (\frac{p}{1-p})= \beta_0 + \beta_1 X\]
dat2015 <- nyts_data |>
filter(year == 2015, !is.na(Sex))
currEcigSex <- logistic_reg() |>
set_engine("glm") |>
fit(as.factor(ecig_current) ~ Sex, data = dat2015, family = "binomial")
(currEcigSexTidy <- tidy(currEcigSex))# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.32 0.0377 -61.5 0
2 Sexmale 0.425 0.0490 8.66 4.73e-18
\[ \begin{aligned} log(odds \ of \ curr. \ e-cig \ use) = \\ &\quad \beta_0 + \beta_1 \cdot Sex = \\ &\quad -1.89 - 0.425 \cdot (Sex == female) \end{aligned} \]
\[ \begin{aligned} log(odds \ of \ curr. \ e-cig \ use) = \\ &\quad \beta_0 + \beta_1 \cdot Sex = \\ &\quad -1.89 - 0.425 \cdot (Sex == female) \end{aligned} \]
Note: model does not consider other variables yet…i.e. Age |
dat2015_survey_design <- dat2015 |>
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE)
currEcigSex_svy <- survey::svyglm(ecig_current ~ Sex,
family = quasibinomial(link = 'logit'),
design = dat2015_survey_design)
tidy(currEcigSex_svy)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.28 0.0683 -33.4 1.15e-42
2 Sexmale 0.383 0.0700 5.48 7.52e- 7
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.28 0.0683 -33.4 1.15e-42
2 Sexmale 0.383 0.0700 5.48 7.52e- 7
Q4: Is there a relationship between e-cigarette/vaping use and other tobacco use?
❓ Your Turn: What would we likely model given what we’ve discussed so far?